Code
library(tidyverse)
library(lubridate)
library(leaflet)In today’s lecture we are going to learn how to make use of a number of R packages that allow you to directly query and access data from some notable environmental databases. These data access packages are incredibly powerful as they allow you to streamline your workflow by combining the data identification and acquisition steps directly with your data analysis. They also provide a highly efficient and reproducible approach to data identification and acquisition - allowing you to easily adapt or expand your research question with minimal additional effort.
There are many packages that have been developed for accessing data through R, including a wide range of packages for accessing environmental datasets. Today we will look at a few notable packages and in the coming weeks we will explore some additional ones. The packages below highlight a number of notable research-quality datasets that are likely to be useful in many of your class projects, theses, and future research/work.
Let’s load in a few packages that we will use today. It is likely that many of you may not yet have installed the leaflet package. If so, you should go ahead and do so before proceeding. FYI, the leaflet package allows you to make interactive maps. You will learn how to do this in the upcoming lectures, however a few of the packages used today will automatically generate leaflet maps and so you will need this package for those maps to be automatically generated.
library(tidyverse)
library(lubridate)
library(leaflet)In much of our environmental research we need to use climate/meteorological data. While there are a host of organization/agencies that provide freely available research quality datasets, locating and accessing the data can often be time-consuming and difficult. Furthermore, once you’ve located the data (generally through the website of the agency/organization collecting the data) you typically need to download the file (e.g., csv, Excel), load it into R, and reformat the data for your analysis. The R packages below allow you to quickly locate the data of interest and directly load it into R in a format that is ready to use in R - typically all within a single function call!
Let’s lesarn how to access the hourly data from NOAA’s Integrated Surface Database using the worldmet package. The worldmet package allows us to efficiently locate and access meteorological data (e.g., air temperature, precipitation, wind speed, barometric pressure) from thousands of meteorological stations from around the globe. The data accessed using the worldmet package comes directly from the National Oceanic and Atmospheric Administration (NOAA) Global Historical Climate Network (GHCN). The GHCN has data available for many sites at hourly intervals and for even more sites at daily intervals.
The data in this database is high-quality and standardized (e.g., data across all locations is reported in the same units, same formatting, etc) and thus allows for the relatively easy use of this data in research/analysis.
Let’s load in the worldmet package. If you haven’t yet installed worldmet you will need to go ahead and do so first.
library(worldmet)The worldmet package is pretty straightforward to use - you simply need to specify the station you would like access and the year(s) you would like to download the data for.
We can use the import_ghcn_stations() function to obtain a data frame containing information (e.g., station ID, name, lat/long,…) about each of the stations that have data available for download. If you are interested in sites that have meteorological data measured hourly, then you should specify database = "hourly" when you call the function below.
worldmet_sites_hourly_df <- import_ghcn_stations(database = "hourly")Take a look at the data frame you just obtained and you will see that the id column contains the station ID that is unique to each site. The data frame also has a bunch of additional useful information about each site, including the time period over which data is available.
When using the import_ghcn_stations() function you can set the return argument to return = "map" and an interactive map of all of available stations nearby your area of interest will appear. Let’s make a map of all of the available sites and let’s use the lat and lng arguments to have the map initially centered over Schenectady.
import_ghcn_stations(lat = 42.6, lng = -73.8, return = "map", database = "hourly")FYI, the map defaults to the 10 nearest stations, however you can change this by setting the n_max argument to a number of your choosing.
Ok, now we can go ahead and use the import_ghcn_hourly() function to download some hourly meteorological data. This function requires two arguments. The station argument is where you specify the station ID. The year argument is where you specify the year(s) you would like to download. To specify multiple years you would set the years argument as follows years = 2010:2022, which in that case would download hourly data from 2010 through 2022.
Let’s go ahead and download hourly data for Albany International Airport (station ID: USW00014735) for 2025 through the most recent reported measurement (~ 1 or 2 days prior to today).
worldmet_data_hourly <- import_ghcn_hourly(station = "USW00014735",
year = 2025:2026)Descriptions of all the variables and their units are available here
Important: All data are reported in Greenwich Meantime (GMT) time zone format. This means that you may want to convert the time data to the stations local time zone.
Let’s quickly plot some of the data to see how it looks.
Hourly temperature data
worldmet_data_hourly %>%
ggplot(aes(x = date, y = air_temp)) +
geom_line() +
theme_bw()Hourly relative humidity data
worldmet_data_hourly %>%
ggplot(aes(x = date, y = rh)) +
geom_line() +
theme_bw()We can also use the worldmet package to obtain daily time-interval meteorological data from the GHCN daily data from NOAA.
The approach is nearly identical to what we did above for the hourly data.
Let’s first get a database of the stations that have daily data available.
worldmet_sites_daily_df <- import_ghcn_stations(database = "daily")Now let’s obtain data for the city for Zurich, Switzerland (station ID: SZ000003700). Since we want daily data we will use the import_ghcn_daily() function.
worldmet_data_daily <- import_ghcn_daily(station = "SZ000003700",
year = 2025:2026)Take a look at the data frame and familiarize yourself with the data.
Descriptions of the variables are available here.
Once you’ve familiarized yourself with the data go ahead and make the figure below showing the daily min and max air temperature from 2025 through to the most recently reported data.
worldmet_data_daily %>%
ggplot() +
geom_line(aes(x = date, y = tmin), color = "blue") +
geom_line(aes(x = date, y = tmax), color = "red") +
labs(title = worldmet_data_daily$station_name[1],
x = "Date",
y = "Temperature (C)",
caption = paste("Data source GHCNd station", worldmet_data_daily$station_id[1])
) +
theme_classic()dataRetrieval packageThe USGS National Water Information Service (NWIS) is an incredible database containing physical and chemical data on the streams/rivers, lakes, and groundwater for nearly 2 million locations across the US and US territories. From NWIS you can download data related to thousands of different parameters including streamflow, groundwater levels, and a wide range of water chemistry and physical conditions.
This data can be queried and retrieved using the dataRetrieval package that was developed and is maintained by the USGS. Given the richness and volume of the data there are many functions and features within this package and some knowledge of how the available data is particularly helpful when using dataRetrieval. Today we’ll learn how to download daily streamflow data and also learn the basics of downloading water quality data. For those interested in learning more I am helping to provide additional guidance and you can also explore the excellent background and tutorial articles that have been put together by the developers of the package.
When querying the NWIS database to find and/or download available data you generally need to specify a parameter (i.e., what was being measured) and a location/region of interest.
If you already know the USGS site number (the unique numeric identifier the USGS assigns to sampling sites) and you know that the parameter of interest was measured at the site you can proceed to the data acquisition step. However, many times we don’t know what/if available sites with the desired data exist and we need to first identify those sites. Once you’ve identified the sites with the desired data, you can then directly acquire the data for the site(s) of interest.
Let’s work through an example where we locate sites in Schenectady county that have mean daily streamflow and/or gauge height (i.e., stream water level) data. Note that we could look for other parameters (e.g., turbidity, concentration of chloride, pH,…) if we were interested in those measurements.
First let’s load in the dataRetrieval package. If you have never used this package before you will need to install it before proceeding.
library(dataRetrieval)We are going to query the NWIS database for sites within Schenectady county (countyCd = "36093") that have streamflow (00060) and/or gauge height (i.e., water level) (00065) data.
Note that every US county has a FIPS county code, which is a unique numeric code that the US has assigned to each county. You can find a list of all the county codes here.
You can also find a list of all of the parameter codes used by the USGS here.
You’ll also note that we specified service = "dv", this means to query for sites that have daily data reported.
sites_schdy <- whatNWISsites(countyCd = "36093",
parameterCd = c("00060","00065"),
service = "dv"
)Warning in whatNWISsites(countyCd = "36093", parameterCd = c("00060", "00065"),
: NWIS servers are slated for decommission. Please begin to migrate to
read_waterdata_monitoring_location
GET: https://waterservices.usgs.gov/nwis/site/?countyCd=36093¶meterCd=00060%2C00065&hasDataTypeCd=dv&format=mapper
sites_schdyYou can see that there are 8 sites that meet the search criteria. You’ll also note that under the site_tp_cd column you see ST, which stand for stream. If you queried for a parameter (e.g., water temperature) that could also apply to other site types, then you would likely see other site_tp_cd values listed too (e.g., GW).
Now let’s get information about the available data (e.g., the number of observations, the length of the data record).
sites_what_data <- whatNWISdata(siteNumber = sites_schdy$site_no,
service = "dv",
parameterCd = c("00060","00065"),
statCd = "00003")Warning in whatNWISdata(siteNumber = sites_schdy$site_no, service = "dv", :
NWIS servers are slated for decommission. Please begin to migrate to
read_waterdata_ts_meta
GET: https://waterservices.usgs.gov/nwis/site/?seriesCatalogOutput=true&sites=01351500,01351450,01354230,01354330,01354500,01356190
Warning: The `file` argument of `vroom()` must use `I()` for literal data as of vroom
1.5.0.
# Bad:
vroom("X,Y\n1.5,2.3\n")
# Good:
vroom(I("X,Y\n1.5,2.3\n"))
ℹ The deprecated feature was likely used in the readr package.
Please report the issue at <https://github.com/tidyverse/readr/issues>.
sites_what_dataIn the above function call you see that we set the argument statCd = "00003". This specifies the statistic to download. The code 00003 means to obtain the average value. We need to specify this since we are downloading daily values and since most sites take sub-hourly readings we need to tell the function what type of daily value to return (e.g., minimum measurement, maximum, average,…). If you want to obtain a different statistic than the average you can simply change the statistic code.
We can see that we have information about the data records for the sites with available gauge height (parm_cd = 00065) and streamflow (parm_cd = 00060). Below is a map of these sites.
Based on this information you might choose to acquire the data for some or all of these sites. Since there are only a few sites, let’s go ahead and download the water gauge height and streamflow data for all of these sites.
df_stream_data <- readNWISdv(siteNumbers = sites_what_data$site_no,
parameterCd = c("00060","00065"),
statCd = "00003")Warning in readNWISdv(siteNumbers = sites_what_data$site_no, parameterCd =
c("00060", : NWIS servers are slated for decommission. Please begin to migrate
to read_waterdata_daily.
GET: https://waterservices.usgs.gov/nwis/dv/?site=01351450%2C01351500%2C01354230%2C01354330%2C01354500%2C01356190&format=waterml%2C1.1&ParameterCd=00060%2C00065&StatCd=00003&startDT=1851-01-01
When you download the data it will have column names that refer to parameter codes and other data descriptors that are not particularly human readable. Let’s use the renameNWISColumns() function, which renames the columns with more human readable names.
df_stream_data <- df_stream_data %>%
renameNWISColumns()Now let’s plot a time-series of the gauge heights for each of the sites.
df_stream_data %>%
filter(!is.na(GH)) %>% # remove NA gauge heights
ggplot(aes(x =Date, y = GH)) +
geom_line() +
facet_wrap(~ site_no, scales = "free")Now let’s plot a time-series of the streamflow for each of the sites.
df_stream_data %>%
filter(!is.na(Flow)) %>%
ggplot(aes(x =Date, y = Flow)) +
geom_line() +
facet_wrap(~ site_no, scales = "free")The NWIS database has over 4.4 million water quality measurements across hundreds of thousands of locations throughout the US and US territories. We can use the dataRetrieval package to query and access this data. Furthermore, the dataRetrieval package also allows us to access water quality data from the Environmental Protection Agency (EPA) and the US Department of Agriculture (USDA). As noted earlier, the dataRetrieval package and understanding NWIS data can take a bit of time to learn, though we’ll work through an example to highlight just how powerful dataRetrieval is for obtaining research grade water quality measurements.
The key first step is to specify what water quality parameter(s) you are interested in obtaining. You can find the list of the available USGS parameter codes here.
You can also obtain a data frame of the USGS, EPA, and USDA parameter codes directly in R, by calling parameterCdFile. Let’s do that now.
parm_cd_df <- parameterCdFile Take a look at this data frame. You’ll see that there are more than 24,000 parameter codes! You’ll see that each parameter code is assigned to one 16 different groups (e.g., physical, microbiological, nutrient,…). This makes it easier to located parameters by their group/type. You’ll also see that the full name and the units associated with each parameter are reported in the data frame.
Let’s go ahead and obtain data for Nitrate, water, filtered, milligrams per liter as nitrogen which is parameter code (00618).
Since there are likely 10’s or hundreds of thousands on nitrate measurements in the NWIS database we are going to limit our query by location and by date.
Let’s use the whatWQPdata() function to find out what data is available for nitrate (see the parameterCd argument) in New York state (see the stateCd argument). We will limit our query to sites that have at least some measurements that were collected after 1-Jan-2015 (see the startDate argument). Note that the sites can have data collected before that date, however they must have at least one measurement after that date in order to satisfy our search criteria.
Note that we are filtering the data so that we only keep groundwater sites (i.e., wells and springs).
pCode <- c("00618")
NY_NO3_sites <- whatWQPdata(stateCd = "NY",
parameterCd = pCode,
startDate = "2020-01-01"
) %>%
filter(ResolvedMonitoringLocationTypeName %in% c("Well", "Spring"))GET: https://api.waterdata.usgs.gov/samples-data/codeservice/states?mimeType=application%2Fjson
NEWS: Data does not include USGS data newer than March 11, 2024. More details:
https://doi-usgs.github.io/dataRetrieval/articles/Status.html
GET: https://www.waterqualitydata.us/data/Station/search?statecode=US%3A36&pCode=00618&startDateLo=01-01-2020&count=no&mimeType=geojson
Take a look at the NY_NO3_sites data frame. You’ll see that this data frame has information on all of the available data based on the above query. You’ll see that each row represents a site with data. The last three columns of this data frame are particularly useful - they contain the date of the first (begin_date), last (end_date) measurement, and the number of total measurements for that site (count_nu). You’ll see that some sites have multiple measurements, indicating that samples were collected on a number of different dates/times.
Now, we will use the readWQPqw() function to acquire the data for the sites we have identified.
To do this we need to specify a site(s) to obtain. We do this by providing the siteNumbers argument with a vector of sites numbers.
We also need to specify the time period for the measurements we would like to obtain. We will limit ourselves to only downloading measurements that were taken from 1-Jan-2015 onward (see the startDate argument).
We also need to specify the parameter to download. Since we are interested in nitrate data we set the parameterCd argument to be equal to the nitrate parameter code of interest (i.e., 00618).
NY_NO3_recent_data <- readWQPqw(siteNumbers = NY_NO3_sites$MonitoringLocationIdentifier,
parameterCd = pCode,
startDate = "2020-01-01"
)POST: https://www.waterqualitydata.us/data/Result/search?mimeType=csv
NEWS: Data does not include USGS data newer than March 11, 2024. More details:
https://doi-usgs.github.io/dataRetrieval/articles/Status.html
Now we’ve downloaded the data - take a look at the data frame. You’ll see the measurement values in the ResultMeasureValue column.
Recall that some sites have multiple measurements (since measurements were taken on a number of different dates/times). Let’s summarize our data so that we can get the maximum nitrate concentration observed at each of the sites.
To do this we are going to group by MonitoringLocationIdentifier (i.e., group by the different sampling sites/locations) and then summarize the nitrate measurements to take the maximum value observed at each site.
NY_NO3_recent_data_summary <- NY_NO3_recent_data %>%
filter(!is.na(ResultMeasureValue)) %>% # remove rows with NA values for the measurement of interest
group_by(MonitoringLocationIdentifier) %>%
summarize(NO3_max = max(ResultMeasureValue, na.rm = T)) %>%
left_join(attr(NY_NO3_recent_data, "siteInfo")) # joining our water quality data with the data frame that contains site information Joining with `by = join_by(MonitoringLocationIdentifier)`
Take a look at this data frame to make sure you understand the steps above and what we have done.
Below is a map showing the data that we just summarized.
We will learn how to make these types of maps in upcoming lectures.
Continue to experiment with the packages that we have learned how to use today. Think about how you might apply these packages to your final projects. Some possible things to explore include:
Try obtaining daily or hourly meteorological data for site(s) that you are interested in.
Compare meteorological conditions between sites of interest.
Perform some of the time-series visualization/analysis approaches to the meteorological data.
Use the dataRetrieval package to obtain data for additional parameters and sites from the NWIS data base. Spend some time exploring this data.